Proctree improvements (RSS/Performance) #4261

geyslan · 2024-08-22T21:06:20Z

1. Explain what the PR does

4b89620 fix: create a new fresh string for comm
cb89e2f perf: use proctree expirable lru cache
976704f perf: remove mutex, add concurrency tests
36ca022 chore: benchmark proctree

4b89620 fix: create a new fresh string for comm

comm had the same inner address as the original string which it was
created from (basename). This caused the original underlying bytes to
be preserved and not garbage collected.

Converting it to a byte slice and then back to a string guarantees that
a new string is created releasing the original reference.

cb89e2f perf: use proctree expirable lru cache

hashicorp LRU increases RSS over time. This change uses the proctree
lru cache to an expirable one, which will remove entries after a
certain time.

Two flags/options were added to the proctree command to control the
cache TTLs. The default is 120 seconds.

flags:

  --proctree process-cache-ttl=60
  --proctree thread-cache-ttl=60

yaml config:

    cache-ttl:
        process: 60
        thread: 60

976704f perf: remove mutex, add concurrency tests

Public proctree methods already make use of the mutex of the inner lru
cache. This commit removes the outer mutex and adds concurrency tests
to ensure that the proctree is safe to use concurrently.

2. Explain how to test it

Run tracee

sudo ./dist/tracee -e execve --proctree source=both --proctree process-cache-ttl=60 --proctree thread-cache-ttl=60 -o none

Stress it

i=0; while ((i < 300000)); do ls >/dev/null; ((i++)) ; done

Measurements

We consider looking at only the 000000c0000000000 segment, narrowing down to where the proctree allocates memory. Not doing this would give us the full RSS which contains the perf buffer segments that are allocated per cpu.

PROCTREE MAIN
-------------

000000c000000000   53248   50932   50932 rw---   [ anon ] - right after 5s of running
000000c000000000  393216  388064  388064 rw---   [ anon ] - right after stress (stress stopped)
000000c000000000  393216  388072  388072 rw---   [ anon ] - +2min after
000000c000000000  393216  310312  310312 rw---   [ anon ] - +2min after
000000c000000000  393216  310056  310056 rw---   [ anon ] - +2min after

*A reduction of 78008 KB (after 6min) but without further decrease.

NEW PROCTREE (using an expirable cache with 60s TTL)
------------

000000c000000000   53248   50864   50864 rw---   [ anon ] - right after 5s of running
000000c000000000  397312  394204  394204 rw---   [ anon ] - right after stress (stress stopped)
000000c000000000  397312   73280   73280 rw---   [ anon ] - +2min after
000000c000000000  397312   49988   49988 rw---   [ anon ] - +2min after
000000c000000000  397312   48580   48580 rw---   [ anon ] - +2min after

*A reduction of 345624 KB (after stress and cache expiring).

976704f perf: remove mutex, add concurrency tests

That reduced lock contention.

/home/gg/.goenv/versions/1.22.4/bin/go test -benchmem -run=^$ -tags ebpf -bench ^BenchmarkProcessTree$ github.com/aquasecurity/tracee/pkg/proctree -benchtime=1000000x -race

| Benchmark                                |   new   |   old   | % Improv |
|                                          | (ns/op) | (ns/op) |          |
|------------------------------------------|---------|---------|----------|
| GetProcessByHash-Concurrency1-32         |  176.1  |  297.5  | 40.81%   |
| GetProcessByHash-Concurrency2-32         |  583.1  | 1513.0  | 61.46%   |
| GetProcessByHash-Concurrency4-32         | 1103.0  | 2404.0  | 54.12%   |
| GetProcessByHash-Concurrency8-32         | 3222.0  | 3605.0  | 10.62%   |
| GetOrCreateProcessByHash-Concurrency1-32 |  298.9  |  333.4  | 10.36%   |
| GetOrCreateProcessByHash-Concurrency2-32 |  759.7  | 1723.0  | 55.90%   |
| GetOrCreateProcessByHash-Concurrency4-32 | 1343.0  | 3133.0  | 57.14%   |
| GetOrCreateProcessByHash-Concurrency8-32 | 3532.0  | 5780.0  | 38.89%   |
| GetThreadByHash-Concurrency1-32          |  291.1  |  372.4  | 21.83%   |
| GetThreadByHash-Concurrency2-32          |  695.9  | 2630.0  | 73.54%   |
| GetThreadByHash-Concurrency4-32          | 1574.0  | 2503.0  | 37.12%   |
| GetThreadByHash-Concurrency8-32          | 3282.0  | 3301.0  |  0.58%   |
| GetOrCreateThreadByHash-Concurrency1-32  |  300.2  |  336.5  | 10.80%   |
| GetOrCreateThreadByHash-Concurrency2-32  |  837.1  | 2537.0  | 67.01%   |
| GetOrCreateThreadByHash-Concurrency4-32  | 1764.0  | 1704.0  | -3.52%   |
| GetOrCreateThreadByHash-Concurrency8-32  | 3946.0  | 5282.0  | 25.28%   |

Measurements Conclusion

With this PR, after stress and cache expiring, we have a decrease around 267616KB (310056KB [old final cache] - 48580KB [new final cache]) in the size of proctree and significant improvement in CPU time of the public methods of proctree.

3. Other comments

Public proctree methods already make use of the mutex of the inner lru cache. This commit removes the outer mutex and adds concurrency tests to ensure that the proctree is safe to use concurrently.

hashicorp LRU increases RSS over time. This change uses the proctree lru cache to an expirable one, which will remove entries after a certain time. Two flags/options were added to the proctree command to control the cache TTLs. The default is 120 seconds. flags: --proctree process-cache-ttl=60 --proctree thread-cache-ttl=60 yaml config: cache-ttl: process: 60 thread: 60

comm had the same inner address as the original string which it was created from (basename). This caused the original underlying bytes to be preserved and not garbage collected. Converting it to a byte slice and then back to a string guarantees that a new string is created releasing the original reference.

rscampos · 2024-08-23T13:28:41Z

double checked the benchmark in my env, good improvement!

GOOS=linux CC=clang GOARCH=arm64 CGO_CFLAGS="-I/vagrant/dist/libbpf -I/vagrant/3rdparty/libbpf/include/uapi" CGO_LDFLAGS="-L/vagrant/dist/libbpf/obj -lbpf " go test -benchmem -run=^$ -bench ^BenchmarkProcessTree$ github.com/aquasecurity/tracee/pkg/proctree -benchtime=1000000x -race

BenchmarkProcessTree/GetProcessByHash-Concurrency1-4             612.0 ns/op  309.2 ns/op  49.45%
BenchmarkProcessTree/GetProcessByHash-Concurrency2-4            1508 ns/op   947.5 ns/op  37.20%
BenchmarkProcessTree/GetProcessByHash-Concurrency4-4            3133 ns/op   1918 ns/op   38.77%
BenchmarkProcessTree/GetProcessByHash-Concurrency8-4            6332 ns/op   3646 ns/op   42.44%
BenchmarkProcessTree/GetOrCreateProcessByHash-Concurrency1-4     667.8 ns/op  491.9 ns/op  26.35%
BenchmarkProcessTree/GetOrCreateProcessByHash-Concurrency2-4    1694 ns/op   1377 ns/op   18.71%
BenchmarkProcessTree/GetOrCreateProcessByHash-Concurrency4-4    3396 ns/op   2707 ns/op   20.29%
BenchmarkProcessTree/GetOrCreateProcessByHash-Concurrency8-4    6842 ns/op   5419 ns/op   20.83%
BenchmarkProcessTree/GetThreadByHash-Concurrency1-4              667.1 ns/op  499.3 ns/op  25.14%
BenchmarkProcessTree/GetThreadByHash-Concurrency2-4             1673 ns/op   1381 ns/op   17.46%
BenchmarkProcessTree/GetThreadByHash-Concurrency4-4             3405 ns/op   2615 ns/op   23.21%
BenchmarkProcessTree/GetThreadByHash-Concurrency8-4             6844 ns/op   5317 ns/op   22.31%
BenchmarkProcessTree/GetOrCreateThreadByHash-Concurrency1-4      697.7 ns/op  522.8 ns/op  25.09%
BenchmarkProcessTree/GetOrCreateThreadByHash-Concurrency2-4     1694 ns/op   1377 ns/op   18.72%
BenchmarkProcessTree/GetOrCreateThreadByHash-Concurrency4-4     3506 ns/op   2860 ns/op   18.44%
BenchmarkProcessTree/GetOrCreateThreadByHash-Concurrency8-4     6809 ns/op   5406 ns/op   20.60%

rscampos

LGTM, congrats for the good job :)

geyslan added kind/bug area/performance kind/chore milestone/v0.22.0 labels Aug 22, 2024

geyslan requested review from rscampos, yanivagman and NDStrahilevitz August 22, 2024 21:06

geyslan self-assigned this Aug 22, 2024

github-actions bot added kind/documentation area/testing area/UX area/flags and removed area/performance labels Aug 22, 2024

geyslan added 4 commits August 23, 2024 09:14

chore: benchmark proctree

36ca022

perf: remove mutex, add concurrency tests

976704f

Public proctree methods already make use of the mutex of the inner lru cache. This commit removes the outer mutex and adds concurrency tests to ensure that the proctree is safe to use concurrently.

geyslan force-pushed the proctree branch from 6ba23bc to 4b89620 Compare August 23, 2024 12:14

rscampos approved these changes Aug 23, 2024

View reviewed changes

geyslan merged commit 95d2ff1 into aquasecurity:main Aug 23, 2024
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proctree improvements (RSS/Performance) #4261

Proctree improvements (RSS/Performance) #4261

geyslan commented Aug 22, 2024 •

edited

Loading

rscampos commented Aug 23, 2024 •

edited

Loading

rscampos left a comment

Proctree improvements (RSS/Performance) #4261

Proctree improvements (RSS/Performance) #4261

Conversation

geyslan commented Aug 22, 2024 • edited Loading

1. Explain what the PR does

2. Explain how to test it

Run tracee

Stress it

Measurements

Measurements Conclusion

3. Other comments

rscampos commented Aug 23, 2024 • edited Loading

rscampos left a comment

Choose a reason for hiding this comment

geyslan commented Aug 22, 2024 •

edited

Loading

rscampos commented Aug 23, 2024 •

edited

Loading